Human resources are the most valuable asset in any country¹. They are the main reason behind the success or the failure of any organization. In fact, having an educated and competent manpower is the key driver to economic and social development. In this context, the importance of academic education has become undeniable. Therefore, it is crucial to invest money and time in order to study students’ academic performance and figure out effective ways to improve it.
Given the importance of the topic, it has been given particular attention in past research. In fact, many studies have been conducted in order to analyze the factors impacting students’ academic performance. While some studies focused on the psychological variables, such as Franck Amadieu & André Tricot’s research², other researchers have been interested in the impact of other elements such as mobility ³, gender and other socio-economic factors on students’ academic success.
Many reasons motivated us to choose this topic of research. In fact, as students, we are very passionate about the educational field. Thus, we want to provide through this project a detailed analysis that can be used as a reference guide for leaders working in the educational field. Mainly, we want to help schools and universities to have a better understanding of the factors influencing students’ academic performance in order to improve their decision-making processes, students’ success rate and eventually their overall organization.
Source: ¹ Gestion des ressources humaines,Jean-Marie Peretti, 2004. ² Psychological factors which have an effect on student success ,2015. ³ La migration pour études : Regards d’intervenants sur l’accueil et l’intégration des nouveaux étudiants »,2009.
The aim of the project is to understand the evolution of secondary academic performance in France. Our study will mainly focus on 3rd grade students (equivalent to 11th grade in Switzerland) and their results on the Diplôme National du Brevet (DNB) by school.
First, we will observe whether there are improvements or, on the contrary, deterioration in admissions of DNB over the years. From this dataset, we will also make comparisons, particularly at the geographical level, and an analysis of the success rate in terms of distinction for each school.
Then, we will try to understand if there is a correlation between academic success and some socio-economic factors, such as the type of accommodation, the single-parent families rate, and the involvement of schools in students’ physical and sports practice. Finally, despite these factors, we will investigate whether the COVID-19 pandemic has had a direct negative impact on students’ school performance.
What is the evolution of student performance over time and across the different regions/departments of France?
Do socio-economic factors such as the type of accommodation, family situation or college policies have an influence on student success ?
Has the COVID-19 pandemic impacted student performance?
This dataset presents the results of the “diplôme national du brevet” by school, for schools in metropolitan France and for the overseas departments and regions. This data set contains 139’580 observations.
| Variable | Meaning |
|---|---|
| session | Year of the exam session |
| school_id | School identification number |
| school_type | School type divided in six categories: COLLEGE, LYCEE PROFESSIONNEL, LYCEE, EREA, CFA, and AUTRE |
| establishment_name | Name of the establishment |
| education_sector | Education sector categorised as public or private |
| municipality_code | Municipality code |
| municipality | Name of the municipality |
| department_code | Department code. It is to be noted that France has 101 departements. |
| department | Name of the department |
| academy_code | Academy code |
| academy_name | Name of the academy |
| region_code | Region code. It is to be noted that France has 18 administrative, regions |
| region | Name of the region |
| registered | Registered candidates |
| present | Candidates present for the exam |
| admitted | Candidates admitted |
| admitted_without | Candidates admitted without distinction |
| admitted_AB | Candidates admitted with distinction “Assez Bien” |
| admitted_B | Candidates admitted with distinction “Bien” |
| admitted_TB | Candidates admitted with distinction “Très bien” |
| success_rate | “Success rate [Present]/[Admis] as a percentage” |
This data set gathers all schools which have been awarded the “Generation 2024” label. The objective of this label, developed in view of the Paris 2024 Olympic Games, is to develop bridges between the school world and the sports movement in order to encourage young people to take part in physical activity and sport. This data set contains 6’883 observations.
| Variable | Meaning |
|---|---|
| region | Name of the region |
| academy | Name of the academy |
| department | Name of the department |
| municipality | Name of the municipality |
| establishment | Name of the establishment |
| school_id | School identification number |
| school_type | School type |
| education_sector | Education sector categorised as public or private |
| postcode | Postcode |
| adress | Address of the establishment |
| adress_2 | Additional address of the establishment |
| E-mail address of the establishment | |
| students | Number of students in the school |
| priority_education | Indicates whether the school is located in a priority education network (REP) or a reinforced priority education network (REP+) |
| city_school | Indicates whether the school is part of a city school |
| QPV | Position relative to a priority neighbourhood of the city policy. It is a policy aimed at compensating for differences in living standards with the rest of the territory. |
| ULIS | “Indicates whether the school offers a ULIS (Localized Unit for School Inclusion)” |
| SEGPA | “Indicates whether the school has a SEPGA (adapted general and vocational education sections)” |
| sport_section | Indicates whether the school has a sports section |
| agricultural_high_school | Indicates whether the school is an agricultural high school |
| military_high_school | Indicates whether the school is a military high school |
| vocational_high_school | Indicates whether the establishment is labeled “vocational high school” |
| establishment_web | Url of the description of the establishment page on the ONISEP website |
| SIREN_SIRET | “SIREN/SIRET number of the establishment. SIREN is for Business Register Identification System in french.” |
| district | Name of the district to which the school is attached |
| ministry | Ministry responsible for the institution |
| label_start_date | Start date of the “generation 2024” label. Format yyyy/mm/dd |
| label_end_date | End date of the “generation 2024” label. Format yyyy/mm/dd |
| y_coordinate | Y coordinate of the establishment, using the EPSG coordinate system |
| x_coordinate | X coordinate of the establishment, using the EPSG coordinate system |
| epsg | EPSG code of the coordinate system used to locate the establishment |
| precision_on_localisation | Specification of the geographical location of the establishment |
| latitude | Latitude |
| longitude | Longitude |
| position | Geographical position |
| engaging_30_sport | Indicates whether the institution participates in the 30 minutes of daily physical activity programme |
This dataset records enrolment in secondary schools according to the type of accommodation for pupils: half-board, boarding school etc. This data set contains 32’096 observations.
| Variable | Meaning |
|---|---|
| year_back_to_school | Year of the start of the school year |
| Academic_region | Name of the academic region |
| academy | Name of the academy |
| department | Name of the department |
| municipality | Name of the municipality |
| number | School identification number |
| establishment_main_name | Main name of the establishment |
| establishment_name | Name of the establishment |
| school_type | School type |
| education_sector | Education sector categorised as public or private |
| students_secondary_education | Students in secondary education |
| students_higher_education | Number of students in higher education |
| external_students_secondary_education | External students in secondary education |
| half_boarders_students_secondary_education | Half-boarders in secondary education |
| boarding_students_secondary_education | Boarding students in secondary education |
| external_students_higher_education | External students in higher education |
| half_board_students_higher_education | Half-board students in higher education |
| boarding_students_higher_education | Boarding students in higher education |
This data set provides information about the single-parent families in each municipality. The census has been made every five years since 2008. This data set contains 104’986 observations.
| Variable | Meaning |
|---|---|
| geocode | Geographical code from INSEE |
| municipality | Name of the municipality |
| year | Census year |
| sing_par | Single-parent families |
This is a time based data set that gives us information on the COVID tests and results carried out by laboratories, hospitals, pharmacists, doctors and nurses. It is updated daily. On the 11th October, the data set contained 543’974 observations.
| Variable | Meaning |
|---|---|
| department_code | Department code |
| test_week | Date of the tests. Format yyyy-mm-dd-yyyy-mm-dd |
| educational_level | Description of the age group as [m-n], m and n being the lower and upper limits. |
| age_group | Denomination of the age group. n-1 is used in this case excepet for the oldest group where 18 is used |
| pop | Population |
| positive | Daily patients testing positive |
| tested | Daily patients tested |
| incidence_rate | Incidence rate |
| positivity_rate | Positivity rate |
| screening_rate | Screening rate |
Loading of the data All but the single_parent dataset are CSV files with semicolons as separators. The single_parent data set is in excel format, so we have to use read_excel. Then, we use skip because the document includes extra header information rows.
DNB_par_etablissement <- read_delim(here::here("data/DNB-par-etablissement.csv"), ";", escape_double = FALSE, trim_ws = TRUE)
Etablissements_labellises_generation_2024 <- read_delim(here::here("data/Etablissements-labellises-generation-2024.csv"),";", escape_double = FALSE, trim_ws = TRUE)
Hebergement_eleves_etablissements_2d <- read_delim(here::here("data/Hebergement-eleves-etablissements-2d.csv"), ";", escape_double = FALSE, trim_ws = TRUE)
insee_rp_hist_xxxx <- read_excel(here::here("data/insee_rp_hist_xxxx.xlsx"), skip = 4)
covid_sp_dep_heb_cage_scol_2022_11_30_19h01 <- read_delim(here::here("data/covid_sp_dep_heb_cage_scol_2022_11_30_19h01.csv"), ";", escape_double = FALSE, trim_ws = TRUE)We have realised that some wrangling are necessary for each data sets. We have established a checklist that we will go through for each data set. We have to :
rename_df <- function(df, x){
if (ncol(df) == length(x)){
names(df) <- c(x)
df <- as_tibble(df)
} else {
stop("Vector is not the right length")
}
}
dnb_colnames <- c("session", "school_id", "school_type", "establishment_name", "education_sector", "municipality_code", "municipality", "department_code", "department", "academy_code", "academy_name", "region_code", "region", "registered", "present", "admitted", "admitted_without", "admitted_AB", "admitted_B", "admitted_TB", "success_rate_pct"
)
dnb_results <- rename_df(DNB_par_etablissement, dnb_colnames)xx,xx% we want it as a
double of the form xx.xx
dnb_results[["success_rate_pct"]] <- as.double(gsub("%","",
gsub(",",".", dnb_results[["success_rate_pct"]])))department_fr and drop the
overseas collectivities (COM).
dnb_results$department_fr <- stri_trans_general(dnb_results$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
dnb_results <- dnb_results %>%
dplyr::filter(!department_fr %in% c("Polynesie-Française","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "NA", "-"))dnb_results <- dnb_results %>%
mutate(without_pct = admitted_without/admitted*100,
AB_pct = admitted_AB/admitted*100,
B_pct = admitted_B/admitted*100,
TB_pct = admitted_TB/admitted*100
)We can see the final table dnb_results below.
est_24_names <- c("region", "academy", "department", "municipality", "establishment", "school_id", "school_type", "education_sector", "postcode", "adress", "adress_2", "mail", "students", "priority_education", "city_school", "QPV", "ULIS", "SEGPA", "sport_section", "agricultural_high_school", "military_high_school", "vocational_high_school", "establishment_web", "SIREN_SIRET", "district", "ministry", "label_start_date", "label_end_date", "y_coordinate", "x_coordinate", "epsg", "precision_on_localisation", "latitude", "longitude", "position", "engaging_30_sport")
establishment_24 <- rename_df(Etablissements_labellises_generation_2024, est_24_names)
establishment_24 <- establishment_24 %>%
mutate(session_started = case_when(month(label_start_date) <= 7 ~ year(label_start_date),
month(label_start_date) > 7 ~ year(label_start_date)+1),
session_ended = case_when(month(label_end_date) <= 7 ~ year(label_end_date),
month(label_end_date) > 7 ~ year(label_end_date)+1)
)department_fr.
establishment_24$department_fr <- stri_trans_general(establishment_24$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")We can see on the map below, that the data set contains establishments from the overseas collectivities (COM) but from the French international schools as well.
As previsouly discussed we have decided to keep only data from mainland France. We have to make sure that we also remove the French international schools. We also take this opportunity to remove unused variables.
establishment_24 <- establishment_24 %>%
dplyr::filter(!department_fr %in% c("Polynesie-Francaise","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "Saint-Martin", "-")) %>%
dplyr::filter(!department_fr == "NA")#"NA" and "-" makes sure that we have no more International schools.
#establishment_24 has a lot of variables which we will for sure not use
establishment_24 <- establishment_24 %>%
select(-c(postcode:mail,city_school,QPV:SEGPA,establishment_web:ministry, precision_on_localisation))high_school_type variable which contains
the information of agricultural_high_school, military_high_school,
vocational_high_school.
establishment_24 <- establishment_24 %>%
mutate(high_school_type = case_when(agricultural_high_school == 1 ~ "agricultural high school",
military_high_school == 1 ~ "military high school",
vocational_high_school == 1 ~ "vocational high school")) %>%
select(-c(agricultural_high_school, military_high_school, vocational_high_school))We can see the final table establishment_24 below.
housing_names <- c("year_back_to_school", "Academic_region", "academy", "department", "municipality", "school_id", "establishment_main_name", "establishment_name", "school_type", "education_sector", "students_secondary_education", "students_higher_education", "external_students_secondary_education", "half_boarders_students_secondary_education", "boarding_students_secondary_education", "external_students_higher_education", "half_board_students_higher_education", "boarding_students_higher_education")
student_housing <- rename_df(Hebergement_eleves_etablissements_2d, housing_names)session variable as
year_back_to_school refers to the beginning of the school
year and not the exam session.
student_housing <- student_housing %>%
mutate(session = year_back_to_school + 1) %>%
select(year_back_to_school,session, everything()) #here just to order variablesdepartment_fr.student_housing$department_fr <- stri_trans_general(student_housing$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")sg_parent_names <- c("geocode", "department", "session","sing_par")
single_parent <- rename_df(insee_rp_hist_xxxx, sg_parent_names)single_parent[["session"]]<- as.double(single_parent[["session"]])
single_parent[["sing_par"]]<- as.double(single_parent[["sing_par"]])department_fr.single_parent$department_fr <- stri_trans_general(single_parent$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")covide_names <- c("department_code", "test_week", "educational_level", "age_group", "pop", "positive", "tested", "incidence_rate", "positivity_rate", "screening_rate")
covid_in_schools <- rename_df(covid_sp_dep_heb_cage_scol_2022_11_30_19h01,covide_names)covid_in_schools[["positive"]] <- as.double(gsub(",",".", covid_in_schools[["positive"]]))
covid_in_schools[["incidence_rate"]] <- as.double(gsub(",",".", covid_in_schools[["incidence_rate"]]))
covid_in_schools[["positivity_rate"]] <- as.double(gsub(",",".", covid_in_schools[["positivity_rate"]]))covid_in_schools <- covid_in_schools %>%
mutate(test_date = case_when (as.numeric(substr(test_week, 1,4))== 2020
~ lubridate::ymd('2019-12-30') + lubridate::weeks(as.numeric(substr(test_week, 7,8))),
as.numeric(substr(test_week, 1,4))== 2021
~ lubridate::ymd('2021-01-04') + lubridate::weeks(as.numeric(substr(test_week, 7,8))),
as.numeric(substr(test_week, 1,4))== 2022
~ lubridate::ymd('2022-01-03') + lubridate::weeks(as.numeric(substr(test_week, 7,8)))
))
covid_in_schools <- covid_in_schools %>%
mutate(session = case_when(month(test_date) <= 7 ~ year(test_date),
month(test_date) > 7 ~ year(test_date)+1))department_fr and region. We use the
department_fr and region variables from
dnb_results. We join the two datasets by department_code.
To do this, we first need to match the two by removing the first
character of the department_code from dnb_results.reg_department <- dnb_results %>%
select(c("department_code", "department_fr", "region")) %>%
unique()
reg_department$department_code <- substring(reg_department$department_code, 2)
covid_in_schools <- right_join(x = covid_in_schools, y = reg_department, by = "department_code")covid_in_schools <- covid_in_schools %>%
filter(educational_level == "[11-15)")We can see the final table covid_in_schools below.
We will use the ggplot France map for our visualizations
map <- map_data("france")The region variable is in fact the departments. We rename it “department_fr” to fit with in the other data sets.
colnames(map)[5]<- "department_fr"To explore this data set we have decided to start on a national level to analyse the global tendency. We will then go down a level to a regional analysis to compare the number of students and see which region performs better. An analysis at the regional level will then be performed to dig deeper into the success rate and the graduation rate for each mention. To be complete with our analysis, we will see the results by establishment for the best and worst performing establishments in 2020. We will use their results of 2006 in comparison.
Talk about the two graph (e.g. number of students growing, what is the number of students, pct growth, … ). Mention that the shift at year 2017 will be analysed in the analysis.
France_results <- dnb_results %>%
#select(session,registered, present, contains("admitted")) %>%
group_by(session) %>%
summarise(registered = sum(registered),
present = sum(present),
admitted = sum(admitted),
admitted_without = sum(admitted_without),
admitted_AB = sum(admitted_AB),
admitted_B = sum(admitted_B),
admitted_TB = sum(admitted_TB),
without_pct = mean(without_pct, na.rm = TRUE),
AB_pct = mean(AB_pct, na.rm = TRUE),
B_pct = mean(B_pct, na.rm = TRUE),
TB_pct = mean(TB_pct, na.rm = TRUE),
success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>%
pivot_longer(c(registered, present,contains("admitted")),
names_to = "Candidates",
values_to = "Number_of_students") %>%
pivot_longer(c(contains("pct")),
names_to = "Mention_type",
values_to = "Rate")The graph below was made by first grouping data by
session. A sum was then applied to summarize
the variables.
p <- France_results %>%
ggplot(aes(x = session, y = Number_of_students, group = Candidates, color = Candidates))+
geom_line()+
scale_color_viridis(discrete = TRUE) +
ggtitle("National DNB statistics") +
theme_ipsum() +
ylab("Number of students")
ggplotly(p, tooltip = c("x" ,"y"))We can notice that the
National DNB statistics (rate)
p <- France_results %>%
ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
geom_line()+
scale_color_viridis(discrete = TRUE) +
ggtitle("National DNB statistics") +
theme_ipsum() +
ylab("Rate in %")
ggplotly(p, tooltip = c("x" ,"y"))Success rate
p <- dnb_results %>%
select(success_rate_pct, region, department) %>%
group_by(region) %>%
summarise(success_rate = mean(success_rate_pct, na.rm = TRUE)) %>%
ggplot(aes(x = region,
y = success_rate,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Success rate") +
ylab("Rate in %")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))One description for all of what we see and highlight differences (e.g. Bretagne has higher proportion of students with mentions than no mention).
p <- dnb_results %>%
select(admitted, region) %>%
group_by(region) %>%
summarise(admitted = sum(admitted)) %>%
ggplot(aes(x = region,
y = admitted,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Number of admitted by region") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))
p <- dnb_results %>%
select(admitted_without, region) %>%
group_by(region) %>%
summarise(zero = sum(admitted_without)) %>%
ggplot(aes(x = region,
y = zero,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with zero mention") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))
p <- dnb_results %>%
select(admitted_AB, region) %>%
group_by(region) %>%
summarise(AB = sum(admitted_AB)) %>%
ggplot(aes(x = region,
y = AB,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention AB") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))
p <- dnb_results %>%
select(admitted_B, region) %>%
group_by(region) %>%
summarise(B = sum(admitted_B)) %>%
ggplot(aes(x = region,
y = B,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention B") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))
p <- dnb_results %>%
select(admitted_TB, region) %>%
group_by(region) %>%
summarise(TB = sum(admitted_TB)) %>%
ggplot(aes(x = region,
y = TB,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention TB") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))####{-}
One description for all of what we see and highlight differences (e.g. Bretagne in B and zero, zero and AB decreasing and B and TB increasing)
y_scale_pct <- scale_y_continuous(limits = range(c(0:100)))
p <- dnb_results %>%
select(success_rate_pct, region, session) %>%
group_by(region, session) %>%
summarise(success_rate = mean(success_rate_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = success_rate,
color = region,
text = region)) +
geom_line() +
y_scale_pct +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Success rate") +
ylab("Rate in %")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("text","x" ,"y" ))
p <- dnb_results %>%
select(without_pct, region, session) %>%
group_by(region, session) %>%
summarise(zero = mean(without_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = zero,
color = region,
text = region)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with zero mention") +
ylab("Rate in %")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("text","x" ,"y" ))
p <- dnb_results %>%
select(AB_pct, region, session) %>%
group_by(region, session) %>%
summarise(AB = mean(AB_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = AB,
color = region,
text = region)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention AB") +
ylab("Rate in %")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("text","x" ,"y" ))
p <- dnb_results %>%
select(B_pct, region, session) %>%
group_by(region, session) %>%
summarise(B = mean(B_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = B,
color = region,
text = region)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention B") +
ylab("Rate in %")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("text","x" ,"y" ))
p <- dnb_results %>%
select(TB_pct, region, session) %>%
group_by(region, session) %>%
summarise(TB = mean(TB_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = TB,
color = region,
text = region)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention TB") +
ylab("Rate in %")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("text","x" ,"y" ))####{-}
dnb_pct_dep <- dnb_results %>%
group_by(department, session) %>%
summarise(AB_pct_dep = mean(AB_pct, na.rm = TRUE),
B_pct_dep = mean(B_pct, na.rm = TRUE),
TB_pct_dep = mean(TB_pct, na.rm = TRUE),
without_pct_dep = mean(without_pct, na.rm = TRUE),
success_rate_pct_dep = mean(success_rate_pct, na.rm = TRUE))
datatable(dnb_pct_dep)Small descritption for each and one analysis encompassing them all. (e.g mention that confidence interval is small or big, and increasing or decreasing)
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = success_rate_pct_dep,
group = session,
fill = session,
text = department,
text2 = success_rate_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "", y = "Success rate in %",
title ="Success rate of each Department by session")
ggplotly(p, tooltip = c("text","text2"))
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = without_pct_dep,
group = session,
fill = session,
text = department,
text2 = without_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "",
y = "Rate of students with no mention in %",
title ="Rate of students with no mention of each Department by session")
ggplotly(p, tooltip = c("text","text2"))
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = B_pct_dep,
group = session,
fill = session,
text = department,
text2 = B_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "",
y = "Rate of students with mention Bien in %",
title ="Rate of students with mention Bien of each Department by session")
ggplotly(p, tooltip = c("text","text2"))
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = AB_pct_dep,
group = session,
fill = session,
text = department,
text2 = AB_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "",
y = "Rate of students with mention Assez Bien in %",
title ="Rate of students with mention Assez Bien of each Department by session")
ggplotly(p, tooltip = c("text","text2"))
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = TB_pct_dep,
group = session,
fill = session,
text = department,
text2 = TB_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "",
y = "Rate of students with mention Très Bien in %",
title ="Rate of students with mention Très Bien of each Department by session")
ggplotly(p, tooltip = c("text","text2"))Analyse first differences between 2020 and 2006 and then talk about differences between paris and eure et loir
First analysis of the best performing highest rate of TB -> Paris 2020
Paris <- dnb_results %>%
select(school_id,establishment_name,department,session, contains("pct")) %>%
filter(department == "PARIS", session == "2020") %>%
pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
names_to = "Mention_type",
values_to = "Rate")
Paris$Mention_type <- factor(Paris$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph
p <- Paris %>%
ggplot(aes(x = Mention_type,
y = Rate,
fill = Mention_type,
text = establishment_name,
text2 = Rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
guides(fill = "none")+
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
labs( x= "",
y = "Rate in %",
title ="Results for Parisian establishments in 2020")
ggplotly(p, tooltip = c("text","text2"))First analysis of the best performing highest rate of TB -> Paris 2020
Paris <- dnb_results %>%
select(school_id,establishment_name,department,session, contains("pct")) %>%
filter(department == "PARIS", session == "2006") %>%
pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
names_to = "Mention_type",
values_to = "Rate")
Paris$Mention_type <- factor(Paris$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph
p <- Paris %>%
ggplot(aes(x = Mention_type,
y = Rate,
fill = Mention_type,
text = establishment_name,
text2 = Rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
guides(fill = "none")+
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
labs( x= "",
y = "Rate in %",
title ="Results for Parisian establishments in 2006")
ggplotly(p, tooltip = c("text","text2"))Lowest performing department in TB rate and “best” in zero mention rate -> Guyane 2006
Eure <- dnb_results %>%
select(school_id,establishment_name,department,session, contains("pct")) %>%
dplyr::filter(department == "EURE-ET-LOIR", session == "2020") %>%
pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
names_to = "Mention_type",
values_to = "Rate")
Eure$Mention_type <- factor(Eure$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph
p <- Eure %>%
ggplot(aes(x = Mention_type,
y = Rate,
fill = Mention_type,
text = establishment_name,
text2 = Rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
guides(fill = "none")+
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
labs( x= "",
y = "Rate in %",
title ="Results for Eure et Loir establishments in 2020")
ggplotly(p, tooltip = c("text","text2"))Lowest performing department in TB rate and “best” in zero mention rate -> Guyane 2006
Eure <- dnb_results %>%
select(school_id,establishment_name,department,session, contains("pct")) %>%
dplyr::filter(department == "EURE-ET-LOIR", session == "2006") %>%
pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
names_to = "Mention_type",
values_to = "Rate")
Eure$Mention_type <- factor(Eure$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph
p <- Eure %>%
ggplot(aes(x = Mention_type,
y = Rate,
fill = Mention_type,
text = establishment_name,
text2 = Rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
guides(fill = "none")+
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
labs( x= "",
y = "Rate in %",
title ="Results for Eure et Loir establishments in 2006")
ggplotly(p, tooltip = c("text","text2"))
map_theme <- theme(title=element_text(),
plot.title=element_text(margin=margin(20,20,20,20), size=18, hjust = 0.5),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.grid.major= element_blank(),
panel.background= element_blank())
p <- ggplot() +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") +
geom_point(data = establishment_24, aes(x = longitude , y = latitude, text2= establishment), size = 0.5)+
coord_map() +
scale_fill_viridis(name = "Average sucess rate")+
labs(x = "",
y = "",
title = "Average success rate, 2006-2021") +
map_theme
#> Warning in geom_point(data = establishment_24, aes(x = longitude, y =
#> latitude, : Ignoring unknown aesthetics: text2
ggplotly(p, tooltip = c("text2") )The first visualisation we wanted to do was a barplot that would show the evolution of the total number of single-parent families in France from 2007 to 2008. To do this, we had to isolate, in a dataset called sp_, the sing_par and session variables, then summarized the number of single-parents with “sum”. In order to remove the years that did not interest us, we had to use the filter function. We then were able to use geom_bar.
Creation of the map theme
map_theme <- theme(title=element_text(),
plot.title=element_text(margin=margin(20,20,20,20), size=18, hjust = 0.5),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.grid.major= element_blank(),
panel.background= element_blank()) sp1_ <- single_parent %>%
select(c("session", "sing_par")) %>%
group_by(session)%>%
summarise (sing_par = sum(sing_par, na.rm = TRUE))
barplot_pos <- ggplot(data = sp1_, aes(x = session,
y = sing_par,
fill = session))+
geom_col(stat = "identity" )+
scale_x_discrete(labels=as.character(sp1_$session), breaks= sp1_$session)+
scale_fill_gradientn(colors = viridis(6))+
#scale_fill_viridis() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, color = "black"))+
theme_ipsum()
#geom_text(aes(label=sing_par), vjust = 1.6, color = "white", size = 2)
barplot_pos
Based on the barplot, we can make an overall analysis of the evolution
of single-parent families in France over the years. Although we are
missing several years, we can clearly see that the number of
single-parent families has continued to increase since 2007. In fact,
single-parent families numbered 2,427,110 in 2007, whereas in 2018, they
were numbering 3,031,823. In just over 10 years, there has been a 20%
increase.
We need to create a data set needed to create the maps.We join the
single_parent and map data sets using department_fr.
jmap_sp<- left_join(x = map[,-6], y = single_parent, by = "department_fr")p <-jmap_sp %>%
filter(session == 2007) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= sing_par), color = "black") +
scale_fill_viridis(name = "Number of single parent families") +
labs(x = "",
y = "",
title = "Single parent families in 2007")+
map_theme
ggplotly(p, tooltip = c("text","fill"))sing_dnb <- left_join(x = single_parent, y = dnb_results_dep, by = c("department_fr", "session"))sing_dnb <- sing_dnb %>%
mutate(single_parent_per_student_admitted = sing_par/admitted)
#Join with the map data set for the mapping.
sing_dnb_map <- left_join(x = map[,-6], y = sing_dnb, by = "department_fr")
p <-sing_dnb_map %>%
filter(session == 2007) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Number of single parent families") +
labs(x = "",
y = "",
title = "Single parent families per student admitted in 2007")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2008) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Number of single parent families") +
labs(x = "",
y = "",
title = "Single parent families per student admitted in 2008")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2012) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Number of single parent families") +
labs(x = "",
y = "",
title = "Single parent families per student admitted in 2012")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2013) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Number of single parent families") +
labs(x = "",
y = "",
title = "Single parent families per student admitted in 2013")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2017) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Number of single parent families") +
labs(x = "",
y = "",
title = "Single parent families per student admitted in 2017")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2018) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Number of single parent families") +
labs(x = "",
y = "",
title = "Single parent families per student admitted in 2018")+
map_theme
ggplotly(p, tooltip = c("text","fill"))First, we wanted to get an overview of COVID positive cases in France over the years. So we used ggplot with test_date as the x-axis and positive for the y-axis.
p <- covid_in_schools %>%
select(positive, test_date) %>%
group_by(test_date) %>%
summarise(positive = sum(positive)) %>%
ggplot( mapping = aes(x= test_date, y = positive)) +
geom_line() +
labs(title = "French Covid-19 cases from the age group 11 to 15 years old (2020-2022)", x = "Date", y = "Number of cases")+
theme_ipsum()
ggplotly(p, tooltip = c("x","y"))p <- covid_in_schools %>%
select(c("department_code", "positive", "session", "region", "department_fr", "test_date")) %>%
group_by(region, test_date) %>%
summarise(positive = sum(positive)) %>%
ggplot() +
geom_line(mapping = aes(x = test_date, y = positive, color = region))+
scale_color_viridis(discrete = TRUE) +
ggtitle("Covid-19 cases from the age group 11 to 15 years old by region (2020-2022)") +
theme_ipsum()
ggplotly(p, tooltip = c("x","y", "color"))covidpos_dep <- covid_in_schools %>%
select(c("department_code", "incidence_rate", "session", "department_fr" )) %>%
group_by(department_fr, session) %>%
summarise(incidence_rate = mean(incidence_rate, na.rm = TRUE))
covidpos_dep <- left_join(x = map[,-6], y = covidpos_dep)
p <- covidpos_dep %>%
filter(session == 2020) %>%
ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
geom_polygon(aes(fill= incidence_rate), color = "black") +
coord_map()+
scale_fill_viridis(name = "Incidence rate average in 2020")+
map_theme
ggplotly(p)
p <- covidpos_dep %>%
filter(session == 2021) %>%
ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
geom_polygon(aes(fill= incidence_rate), color = "black") +
coord_map()+
scale_fill_viridis(name = "Incidence rate average in 2021")+
map_theme
ggplotly(p)
p <- covidpos_dep %>%
filter(session == 2022) %>%
ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
geom_polygon(aes(fill= incidence_rate), color = "black") +
coord_map()+
scale_fill_viridis(name = "Incidence rate average in 2022")+
map_theme
ggplotly(p)
p <- covidpos_dep %>%
filter(session == 2023) %>%
ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
geom_polygon(aes(fill= incidence_rate), color = "black") +
coord_map()+
scale_fill_viridis(name = "Incidence rate average in 2023")+
map_theme
ggplotly(p)regression Essai
lm1 <- lm(dnb_results$TB_pct ~ dnb_results$without_pct + dnb_results$B_pct + dnb_results$AB_pct + dnb_results$without_pct)
summary(lm1)
#>
#> Call:
#> lm(formula = dnb_results$TB_pct ~ dnb_results$without_pct + dnb_results$B_pct +
#> dnb_results$AB_pct + dnb_results$without_pct)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.71e-09 0.00e+00 0.00e+00 0.00e+00 2.55e-11
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.00e+02 1.17e-13 8.53e+14 <2e-16 ***
#> dnb_results$without_pct -1.00e+00 1.26e-15 -7.95e+14 <2e-16 ***
#> dnb_results$B_pct -1.00e+00 2.25e-15 -4.44e+14 <2e-16 ***
#> dnb_results$AB_pct -1.00e+00 1.50e-15 -6.66e+14 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.74e-12 on 135154 degrees of freedom
#> (29 observations deleted due to missingness)
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 3.04e+29 on 3 and 135154 DF, p-value: <2e-16cluster Essai
dnb_pct_dep <- dnb_results %>%
group_by(department, session) %>%
summarise(AB_pct_dep = mean(AB_pct, na.rm = TRUE),
B_pct_dep = mean(B_pct, na.rm = TRUE),
TB_pct_dep = mean(TB_pct, na.rm = TRUE),
without_pct_dep = mean(without_pct, na.rm = TRUE),
success_rate_pct_dep = mean(success_rate_pct, na.rm = TRUE))
pairs(dnb_pct_dep[2:6])
distance <- dist(dnb_pct_dep)
#> Warning in dist(dnb_pct_dep): NAs introduced by coercion
mydata.hclust <- hclust(distance)
plot(mydata.hclust)Creation of the dataset used for the map
result <- dnb_results %>%
select(department_fr, success_rate_pct) %>%
group_by(department_fr) %>%
summarise(success_rate = mean(success_rate_pct, na.rm = TRUE))Join the map from ggplot and our new dataset
result_map <- left_join(x = map[,-6], y = result)plot